Diversity measurements in the Longitudinal Metastasis Cohort

Libraries

We use lmerTest to perform the linear mixed models. It allows us to test if there is a significant change whilst controlling for different patients.

# Libraries
library(ggplot2)
library(cowplot)
theme_set(theme_cowplot())
library(lmerTest)
library(plyr)
library(reshape2)

Settings for the script

Functions

Here we define functions used to measure distance. Firstly, we define diff_state_dist which is our measure of divergence, the number genomic bins that are different between two samples divided by total number of bins are aberrant (!=2) in at least one sample.

The second function pga_diff_dist, measures the difference in PGA (as percentage points).

Preprocessing

Here we process the clinical data and the CNA calls. We read in the clinical data and then the calls. We also then choose to saturate calls of losses and gain thus that 0 and 1 losses are all treated as a loss and that all gains (3, 4, 5 etc.) are treated as being simply ‘gained’. This means that a change from CN 2 to 5 is treated as being identical to 2 to 1 and 3 to 1, for instance.

Calculate divergence

We then calculate divergence for all comparisons of samples and record the organ, timepoint, type of sample (primary and metastasis) and block id for each.

Mean single timepoint divergence

Next, we calculate the mean divergence in each single time interval for each patient and we report the median value. This is provide a value for the general amount of divergence measured.

## [1] "Median mean timepoint divergence: 0.446"

Within timepoint, same organ divergence

Here we plot the divergences observed within each single timepoint, within a single organ, to represent divergence in space.

Same timepoint, difference organ divergence

Here we plot the divergences observed within each single timepoint, but this time within different organs sampled at these timepoints, and this represents divergence in space between the samples across the body. There are very few examples of timepoints in which multiple organs were assessed. Generally in the cohort colorectal samples are sampled in the first timepoint and liver in latter timepoints.

Within organ divergence, primary and metastasis

Now we collapse this data and group the within timepoint, single organ divergences into primary and metastasis categories to investigate if metastatic deposits show a higher or lower level of within organ diversity than primary tumours. Here we can also see that we have much more data for liver metastases.

Mixed model of within organ divergence, primary and metastasis

Of course, to correctly investigate this we need to perform some statistics. Therefore here we create a mixed model (lmer) of Divergence ~ Metastasis where we use each patient as a random effect to control for the fact that patients may show different total levels of divergence.

## [1] "Mixed model (1) - Primary and Metastasis using patients as random intercepts"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dist ~ Type_i + (1 | Patient)
##    Data: within_prim_mets
## 
## REML criterion at convergence: -274.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9544 -0.5994  0.0348  0.6337  3.2688 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.02009  0.1417  
##  Residual             0.02183  0.1478  
## Number of obs: 334, groups:  Patient, 22
## 
## Fixed effects:
##                   Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)        0.27857    0.04427  65.94358   6.293 2.87e-08 ***
## Type_iMetastasis   0.03917    0.03615 331.69030   1.083    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Type_Mtstss -0.656

## [1] 0.2699961
## [1] "Repeat of mixed model (1) but for PGA difference"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pdiff ~ Type_i + (1 | Patient)
##    Data: within_prim_mets
## 
## REML criterion at convergence: -564.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8739 -0.5932 -0.1060  0.4776  3.2093 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.00425  0.06519 
##  Residual             0.00942  0.09706 
## Number of obs: 334, groups:  Patient, 22
## 
## Fixed effects:
##                   Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)      9.362e-02  2.538e-02 1.178e+02   3.689 0.000343 ***
## Type_iMetastasis 9.851e-03  2.335e-02 3.318e+02   0.422 0.673310    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Type_Mtstss -0.762

Compare divergence within colorectal and liver samples and between them

Now we would like to ask if there is a significant difference between divergence within liver and colorectal samples from the same patient and between them. For this we do not filter by timepoint. As we can see there are far fewer colorectal-colorectal comparisons compared to anything to the liver samples.

ANOVA for difference between intra- and inter- organ variation

So now we ask if any of the categories have a significantly different mean using a simple one-way ANOVA (aov).

## [1] "An ANOVA for comparing divergence in colorectal and liver organs..."
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## Organ_Compare   2   0.43 0.21496   4.731 0.00913 **
## Residuals     637  28.94 0.04543                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "An ANOVA for comparing difference in PGA in colorectal and liver organs..."
##                Df Sum Sq Mean Sq F value Pr(>F)
## Organ_Compare   2  0.006 0.00316   0.193  0.825
## Residuals     637 10.453 0.01641

Divergence found within same timepoint and in different timepoints

Now we want to know if divergence comparisons within timepoints are different from those not in the same timepoint. We now plot this simple category.

To assess this we simply ask what is the effect on divergence if a comparison is made within the same timepoint or not. We again do a mixed model (lmer) using the patient as a random effect.

## [1] "Mixed model (2) - General assessment of whether within timepoint and across timepoint is different"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dist ~ Same_tp + (1 | Patient)
##    Data: within_tps_across_tps
## 
## REML criterion at convergence: -575.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.06957 -0.65190  0.08715  0.71782  2.45416 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.01912  0.1383  
##  Residual             0.02383  0.1544  
## Number of obs: 721, groups:  Patient, 22
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   0.40189    0.03147  23.24270  12.771 5.43e-12 ***
## Same_tpTRUE  -0.07556    0.01192 704.18895  -6.341 4.07e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Same_tpTRUE -0.184

## [1] "Repeat of mixed model (2) but for PGA difference"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pdiff ~ Same_tp + (1 | Patient)
##    Data: within_tps_across_tps
## 
## REML criterion at convergence: -1053.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3485 -0.6916 -0.1002  0.5236  3.0628 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.005582 0.07471 
##  Residual             0.012444 0.11155 
## Number of obs: 721, groups:  Patient, 22
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   0.13456    0.01773  23.85885   7.588 8.26e-08 ***
## Same_tpTRUE  -0.03143    0.00860 707.17480  -3.655 0.000276 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Same_tpTRUE -0.237

Divergences between timepoint, including time normalised

Now we are interested in those divergences between single intervals across the patients both as a total value and also normalised by time (divergence month-1).

Inter-timepoint divergence categorised by treatment status

Now we are interested in whether treatment affects these intervals, so we basically collapse the previous plot and plot divergence both normalised and not normalised by time in the off- and on- treatment categories.

Mixed model of non-time normalised divergence on and off treatment

Here we perform a mixed model to test if absolute differences in divergence are different between treatments, using patients as a random effect.

## [1] "Mixed model (3) - Divergence NOT normalised by time in treated and non-treated intervals using patients as random intercepts"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dist ~ Treatment_prior_recent_tp + (1 | Patient)
##    Data: patient_dists_next_tp
## 
## REML criterion at convergence: -219.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27587 -0.69514  0.05507  0.71404  2.22127 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.02271  0.1507  
##  Residual             0.02167  0.1472  
## Number of obs: 280, groups:  Patient, 21
## 
## Fixed effects:
##                                 Estimate Std. Error         df t value
## (Intercept)                     0.402022   0.039065  29.352775  10.291
## Treatment_prior_recent_tpTRUE  -0.008758   0.030071 254.033221  -0.291
##                               Pr(>|t|)    
## (Intercept)                   2.99e-11 ***
## Treatment_prior_recent_tpTRUE    0.771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Trtm___TRUE -0.447

## [1] "Repeat of mixed model (3) but for PGA difference"
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pdiff ~ Treatment_prior_recent_tp + (1 | Patient)
##    Data: patient_dists_next_tp
## 
## REML criterion at convergence: -389.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4419 -0.5835 -0.0925  0.3880  3.1436 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Patient  (Intercept) 0.008195 0.09052 
##  Residual             0.012037 0.10971 
## Number of obs: 280, groups:  Patient, 21
## 
## Fixed effects:
##                                 Estimate Std. Error         df t value
## (Intercept)                     0.130503   0.024930  30.457223   5.235
## Treatment_prior_recent_tpTRUE  -0.006681   0.021674 218.572965  -0.308
##                               Pr(>|t|)    
## (Intercept)                   1.15e-05 ***
## Treatment_prior_recent_tpTRUE    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## Trtm___TRUE -0.501

Average change in PGA across timepoints

Here we process the data to calculate percentage of the genome aberrated (PGA) and to compare from one timepoint to the next whether this goes up or down and record the difference. Here is PGA in one sample is 0.41 and in the next it is 0.43, we would report it as going up by 0.02 * 100 (2). We average this for each timepoint and report the median of the averages across the cohort.

## [1] "Median mean increase in PGA (+ / -): 11.3"

Mean PGA across time

## [1] "Percentage points change: 6.5"